06. Parallel Crawler
Step 4. Coding the Parallel Web Crawler
Luckily for you, the forward-thinking author of the legacy web crawler created an extensible design that uses a dependency injection framework. They defined an interface WebCrawler
; the legacy code exists in the concrete implementation SequentialWebCrawler
. Definitely not the worst legacy code you've ever seen! All you have to do is create a second implementation, ParallelWebCrawler
, that has the same functionality as the legacy crawler, but takes advantage of multi-core architectures.
Your task is to fill in the src/main/java/com/udacity/webcrawler/ParallelWebCrawler.java
file. Specifically the crawl
method:
public CrawlResult crawl(List<String> startingUrls)
The input is a list of URLs where the crawl should start downloading and parsing HTML. For this part, you can reuse the legacy code that downloads and parses web pages. @Inject
a com.udacity.webcrawler.parser.PageParserFactory
, and use it like this:
PageParser.Result result = parserFactory.get(url).parse();
The implementation of ParallelWebCrawler
must actually run in parallel. A ForkJoinPool
has already been created for you in the ParallelWebCrawler
constructor.
Recall that ForkJoinPool
is a kind of Executor
that is optimized for efficient processing when most work tasks create other subtasks. In other words, it is ideal for building recursive algorithms that run in parallel. In fact, the most common kinds of subtasks are called RecursiveAction
and RecursiveTask
. Take a look at the legacy implementation (src/main/java/com/udacity/webcrawler/SequentialWebCrawler.java
), and you will notice that also uses a recursive algorithm because the crawlInternal()
method invokes itself. Think about ways to take this recursive algorithm and use ForkJoinPool
to make it parallel.
There are multiple correct ways to do this. For starters, you will want to create your own custom task class, which should be a subclass of RecursiveAction
or RecursiveTask
— either choice is fine, but depending on your decision the final code will be slightly different. Each custom task will download and process URLs in a separate thread. Remember how ForkJoinPool
works by having its subtasks create more subtasks? To do that, your custom task class will create other subtasks and run them using the static invoke
or invokeAll
methods.
If it makes sense for the way you structure your implementation, you should consider applying the factory pattern to create these subtasks. You should aim to minimize the presence of large constructor calls with long lists of parameters; the builder pattern can help with this.
Remember that in a parallel program, you can have multiple threads accessing data structures at the same time. Because of this, you should use one or more concurrent data structures to compute the output. Keep the following tips in mind:
For the
urlsVisited
output, the same URL should not be counted twice in the same crawl. Note: if a URL returns an HTTP error (such as a 400 error), the crawler still considers that as a "visit".When computing the
wordCounts
, the crawler should not accidentally count the results from the same page twice. Remember that the crawler will be downloading and processing multiple web pages at the same time, so this could be tricky!Utilities like
Collections.synchronizedCollection
andjava.util.concurrent
will be helpful, but when using these data structures, think carefully about which methods are and are not atomic, and what guarantees those methods provide.
Running Tests
As you work, you can run the provided unit tests on your parallel web crawler implementation:
mvn test -Dtest=WebCrawlerTest,ParallelWebCrawlerTest
These tests do not thoroughly test that the crawler uses correct synchronization, but they will provide a useful signal as to whether ParallelWebCrawler
provides the same functionality as the legacy code. You will need to look over your implementation to ensure it uses correct synchronization.